Module 5: Data Manipulation and Analysis II

Jacob Jameson
Summer 2021

Functions for Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames. It is part of tidyverse.

Single data frame functions / verbs:

  • filter: pick rows matching criteria
  • slice: pick rows using index(es)
  • select: pick columns by name
  • rename: rename specific columns
  • arrange: reorder rows
  • mutate: add new variables
  • transmute: create new data frame with variables

More verbs

  • bind_cols, bind_rows: bind multiple data frames by row and column
  • group_by: create groups of rows according to a condition
  • summarise: apply computations across groups of rows
  • gather: make wide data longer
  • spread: make long data wider
  • map: apply computations across columns, related map_dbl, map_dfr
  • pmap: apply computations across rows, related pmap_dbl, pmap_dfr
  • *_join where * = inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep

Let's load our data!

Our data comes from the National Health and Nutrition Survey.

library(tidyverse)
library(haven)
nhanes <- read_dta('nhanes_clean.dta')
names(nhanes)
 [1] "id"            "age"           "sex"           "race"         
 [5] "height"        "weight"        "sbp"           "dbp"          
 [9] "cholesterol"   "triglycerides" "hdl"           "healthstatus" 
[13] "heartattack"   "diabetes"      "finalwgt"      "strata"       
[17] "psu"           "today"         "dob"           "bmi"          
[21] "HighBP"       

Pause this video and take some time to further explore this data!

Filtering based on multiple conditions

You have see examples of filter() that return rows based on a single condition, but the filter option also allows AND and OR style filters:

  • filter(condition1, condition2) will return rows where both conditions are met.
  • filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.
  • filter(condition1 | condition2) will return rows where condition 1 and/or condition 2 is met.
  • filter(xor(condition1, condition2) will return all rows where only one of the conditions is met, and not when both conditions are met.

Combining multiple conditions

Multiple AND, OR and NOT conditions can be combined. The sample code will return all age 50 respondents that either have a weight above 113kg (250lbs) or a bmi (body mass index) above 30.

nhanes %>%
  filter(age == 50, (weight > 113 | bmi > 30))

Selecting columns based pre-identified columns

There is another option which avoids the continuous retyping of columns names: one_of(). You can set up column names upfront, and then refer to them inside a select() statement by either wrapping them inside one_of() or by using the !! operator. This will also allow you to re-order columns.

blood_related <- c("sbp", "dbp", 
                  "cholesterol", "hdl", 
                  "triglycerides")

nhanes %>%
  select(!!blood_related)

nhanes %>%
  select(one_of(blood_related))

Selecting columns by logical

The select_if function allows you to pass functions which return logical statements. For instance you can select all string columns by using select_if(is.character). Similarly, you can add is.numeric, is.integer, is.double, is.logical, is.factor.

nhanes %>%
  select_if(is.character) %>%
  glimpse
Rows: 1,267
Columns: 1
$ id <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC0007…

Select the negation

You can also select the negation but in this case you will need to add a tilde ~ to ensure that you still pass a formula to select_if. The select_all/if/at functions require that a function or a formula is passed as an argument. If you have to add any negation or arguments, you will have to wrap your function inside funs() or add a tilde ~ before to remake it a function.

nhanes %>%
  select_if(~!is.numeric(.)) %>%
  glimpse
Rows: 1,267
Columns: 2
$ id  <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC000…
$ dob <date> NA, 1943-08-24, 1954-05-08, 1950-06-25, 1987-06-11, 1956-10-30, 1…
nhanes %>%
  select_if(funs(!is.numeric(.))) %>%
  glimpse
Rows: 1,267
Columns: 2
$ id  <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC000…
$ dob <date> NA, 1943-08-24, 1954-05-08, 1950-06-25, 1987-06-11, 1956-10-30, 1…

Selecting columns by multiple logical expressions

Similarly mean > 4 is not a function in itself, so you will need to add a tilde upfront, or wrap it inside funs() to turn the statement into a function.

nhanes %>%
    select_if(is.numeric) %>% 
    select_if(~mean(., na.rm=TRUE)>4 ) %>% 
    glimpse()

or shorter:

nhanes %>%
  select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 4) %>% 
  glimpse()

Renaming columns

If you will be using a select() statement, you can rename straight in the select function. Pay attention to which comes first

nhanes %>%
  select(systolic_blood_pressure = sbp,
         diastolic_blood_pressure = dbp) %>%
  glimpse

If you want to retain all columns and therefore have no select() statement, you can rename by adding a rename() statement.

nhanes %>% 
    rename(systolic_blood_pressure = sbp,
           diastolic_blood_pressure = dbp) %>%
  glimpse

Reformatting all column names

The select_all() function allows changes to all columns, and takes a function as an argument.

nhanes %>%
  select_all(toupper) %>% 
  glimpse
nhanes %>%
  select_all(tolower) %>% 
  glimpse

Mutating columns: the basics

  • You can make new columns with the mutate() function. The options inside mutate are almost endless: pretty much anything that you can do to normal vectors, can be done inside a mutate() function.
  • Anything inside mutate can either be a new column (by giving mutate a new column name), or can replace the current column (by keeping the same column name).
  • One of the simplest options is a calculation based on values in other columns.
nhanes %>%
  mutate(weight_lbs = weight*2.2046) %>%
  glimpse

New columns can be made with aggregate functions such as average, median, max, min, sd, … The sample code makes two new columns: one showing the difference of each row versus the average sbp number, and one showing the difference versus the lowest number for sbp.

nhanes %>%
  select(id, sbp) %>%
  mutate(
    sbp_vs_AVG = sbp - round(mean(sbp), 1),
    sbp_vs_MIN = sbp - min(sbp)) %>% 
  glimpse

`mutate()` with `ifelse()`

The ifelse() function deserves a special mention because it is particularly useful if you don’t want to mutate the whole column in the same way. With ifelse(), you first specify a logical statement, afterwards what needs to happen if the statement returns TRUE, and lastly what needs to happen if it’s FALSE.

nhanes %>%
  select(id, sbp) %>%
  mutate(hypertension = ifelse(sbp > 130, 'Yes', 'No')) %>% 
  glimpse

Creating new multiple levels column

  • If you want more than two levels, it might be even easier to use case_when()
  • The arguments are evaluated in order, so only the rows where the first statement is not true will continue to be evaluated for the next statement. For everything that is left at the end just use the TRUE ~ “newname”.
nhanes %>%
  select(id, sbp) %>%
  mutate(status = case_when(
    sbp > 180 ~ "Hypertensive Crisis",
    sbp > 140 ~ "Hypertension Stage 2",
    sbp > 130 ~ "Hypertension Stage 1",
    sbp > 120 ~ "Elevated",
    TRUE ~ "Normal"))

Note

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame

So to keep the changes, we have to reassign the data frame to be the output of the pipe!

EX:

nhanes <- nhanes %>%
  mutate(weight_lbs = weight*2.2046)

or

nhanes2 <- nhanes %>%
  mutate(weight_lbs = weight*2.2046)

Save the Cleaned Data to File System

  • Once you clean up the data, you may want to save the cleaned data to your file system, so that in the future you can load the cleaned data directly.
  • To save the data, you can use write.csv() function. Change the working directory
  • Use write.csv() to save the data
write.csv(nhanes2, "nhanes_clean.csv")